Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796
Open
Robby955 wants to merge 4 commits intoopenai:mainfrom
Open
Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955 wants to merge 4 commits intoopenai:mainfrom
Conversation
3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 tasks
…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).
|
nice 🔥🔥🔥🔥 |
Add complementary training (from @pentxayc openai#803) and per-order multipliers (from @AayushBaniya2006 openai#809) on top of distributed prefill + 15-gram + order-adaptive gating. New 3-seed results: 0.28798 / 0.28804 / 0.28810 All seeds under 16MB, training under 560s, eval under 330s. Updated README with legality hedge, full ablation, credits.
RoyiRa
added a commit
to RoyiRa/parameter-golf
that referenced
this pull request
Mar 26, 2026
CRITICAL FIX: Previously each of 8 GPU ranks only updated its n-gram cache with its own 1/8 of scored windows. Now ALL ranks update with the FULL chunk (same as mixer already does). PR openai#796 showed this costs ~0.31 BPP: "Without pre-fill, ranks 1-7 start with empty n-gram caches. This costs ~0.31 BPP." Expected: massive improvement from 8x more n-gram data per rank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa
added a commit
to RoyiRa/parameter-golf
that referenced
this pull request
Mar 26, 2026
Full-chunk n-gram cache sharing: 0.6913 -> 0.5865 (-0.105 BPB) This confirms PR openai#796's finding that rank-local caches lose ~0.1+ BPB. WARNING: artifact=16.25MB (over 16MB limit for this seed). Need to increase pruning from 3% to 4%, or reduce bigram_vocab_size, to ensure all seeds fit. Eval time: 492s (within budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa
added a commit
to RoyiRa/parameter-golf
that referenced
this pull request
Mar 26, 2026
…ipliers Novel improvement over uniform entropy threshold: - Per-order entropy center: order 2 → 5.0 (trust only when confused), order max → 2.0 (trust even when model is OK) - Per-order alpha multiplier: order 2 → 0.3× (suppress noise), order max → 2.0× (boost precision) - Linear interpolation between orders for smooth transition Inspired by PR openai#796's ablation showing -0.182 BPP from order-adaptive gating alone. Our implementation is continuous (sigmoid per order) rather than discrete thresholds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… validated) Replace per-order multipliers with recursive Dirichlet posterior predictive. Neural model as informative prior, single concentration c=5.0. 3-seed mean: 0.22923 BPB (std 0.000005). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
Updated submission: 0.6567 → 0.2880 → 0.2292 BPB (3-seed mean, std 0.000005). Replaced per-order multipliers with Dirichlet-Multinomial posterior smoothing (single concentration c=5.0). All logs, code, and submission.json updated in latest commit. |
|
This is one of the cleanest submissions in the competition. Replacing 14 hand-tuned per-order alpha parameters with a single Dirichlet concentration (c=5.0) is elegant — the recursive posterior predictive naturally handles sparsity at high orders without any manual intervention. The math does what entropy thresholds and sigmoid gating are trying to approximate. The 3-seed std of 0.000005 is also remarkable — tightest we've seen across all submissions. Nice work. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: Empirical Bayes N-gram Mixing -- val_bpb=0.2292
What this does
Instead of hand-tuning alpha multipliers for each n-gram order (my previous submission at 0.2880), I replaced the mixing strategy with Bayesian posterior inference.
The formula:
This is the Dirichlet-Multinomial posterior predictive. The neural model is the prior, n-gram counts are the likelihood, concentration
ccontrols the tradeoff. Applied recursively from bigram up to 15-gram, where each order's smoothed estimate becomes the next order's prior.A single global concentration (c=5.0) handles the sparse-count problem that previously required hand-tuned per-order multipliers. The improvement is 0.059 BPB, which I didn't expect from replacing 14 tuned parameters with 1.
Results
Ablation chain
What's novel
Using a neural LM as the base measure in hierarchical Bayesian n-gram smoothing. Traditional Bayesian LMs (MacKay & Peto 1995, Teh 2006) use uniform or unigram priors. This is the Dirichlet special case (discount=0) of the Pitman-Yor family, a sibling to Kneser-Ney, not a generalization of it.
What's borrowed
N-gram cache approach from the community (especially @deanbrr, @lukacf, @Asukabot0, @newjordan). Complementary training from @pentxayc. Per-order multiplier concept from @AayushBaniya2006 (now replaced by Dirichlet). The Bayesian smoothing formula itself is textbook.
Compliance
Technical details
11L transformer (3 shared x 3 loops + 2 unique, EBLS), 512d, 8 heads / 4 KV heads (GQA), complementary n-gram training (alpha=0.5), 15-order recursive Bayesian backoff with concentration=5.0, int6 GPTQ + LZMA compression. ~14.9 MB artifact.
Feedback welcome.